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Abstract 

Dimensionality reduction is a topic of recent interest. In this paper, we present the classification 
constrained dimensionality reduction (CCDR) algorithm to account for label information. The algorithm 
can account for multiple classes as well as the semi-supervised setting. We present an out-of-sample 
expressions for both labeled and unlabeled data. For unlabeled data, we introduce a method of embedding 
a new point as preprocessing to a classifier For labeled data, we introduce a method that improves the 
embedding during the training phase using the out-of-sample extension. We investigate classification 
performance using the CCDR algorithm on hyper-spectral satellite imagery data. We demonstrate 
the performance gain for both local and global classifiers and demonstrate a 10% improvement of 
the fc-nearest neighbors algorithm performance. We present a connection between intrinsic dimension 
estimation and the optimal embedding dimension obtained using the CCDR algorithm. 
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Classification Constrained Dimensionality 

Reduction 

I. Introduction 

In classification theory, the main goal is to find a mapping from an observation space X 
consisting of a collection of points in some containing Euclidean space R'^, d > 1 into a set 
consisting of several different integer valued hypotheses. In some problems, the observations from 
the set X lie on a d-dimensional manifold Ai and Whitney's theorem tells us that provided that 
this manifold is smooth enough, there exists an embedding of Ai into R.^'^^^. This motivates the 
approach taken by kernel methods in classification theory, such as support vector machines [1] 
for example. Our interest is in finding an embedding of into a lower dimensional Euclidean 
space. 
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Fig. 1. PC A of a two-classes classification problem. 

Dimensionality reduction of high dimensional data, was addressed in classical methods such 
as principal component analysis (PCA) [2] and multidimensional scaling (MDS) [3], [4]. In PCA, 
an eigendecomposition of the dx d empirical covariance matrix is performed and the data points 
are linearly projected along the < m < d eigenvectors with the largest eigenvalues. A problem 
that may occur with PCA for classification is demonstrated in Fig. [T] When the information 
that is relevant for classification is present only in the eigenvectors associated with the small 
eigenvalues (e2 in the figure), removal of such eigenvectors may result in severe degradation 
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in classification performance. In MDS, the goal is to find a lower dimensional embedding of 
the original data points that preserves the relative distances between all the data points. The 
two later methods suffer greatly when the manifold is nonlinear. For example, PCA will not be 
able to offer dimensionality reduction for classification of two classes lying each on one of two 
concentric circles. 

In [5], a nonlinear extension to PCA is presented. The algorithm is based on the "kernel trick" 
[6]. Data points are nonlinearly mapped into a feature space, which in general has a higher (or 
even infinite) dimension as compared with the original space and then PCA is applied to the 
high dimensional data. 

In the paper of Tenenbaum et al [7], Isomap, a global dimensionality reduction algorithm 
was introduced taking into account the fact that data points may lie on a lower dimensional 
manifold. Unlike MDS, geodesic distances (distances that are measured along the manifold) 
are preserved by Isomap. Isomap utilizes the classical MDS algorithm, but instead of using the 
matrix of Euclidean distances, it uses a modified version of it. Each point is connected only 
to points in its local neighborhood. A distance between a point and another point outside its 
local neighborhood is replaced with the sum of distances along the shortest path in graph. This 
procedure modifies the squared distances matrix replacing Euclidian with geodesic distances. 

In [8], Belkin and Niyogi present a related Laplacian eigenmap dimensionality reduction 
algorithm. The algorithm performs a minimization on the weighted sum of squared-distances 
of the lower-dimensional data. Each weight multiplying the squared-distances of two low- 
dimensional data points is inversely related to distance between the corresponding two high- 
dimensional data points. Therefore, small distance between two high-dimensional data points 
results in small distance between two low-dimensional data points. To preserve the geodesic 
distances, the weight of the distance between two points that do not share a local neighborhood 
is set to zero. 

We refer the interested reader to the references below and those cited therein for a list of 
some of the most commonly used additional algorithms within the class of manifold learning 
algorithms and their different advantages relevent to our work. Locally Linear Embedding (LLE) 
[9], Laplacian Eigenmaps [8], Hessian Eigenmaps (HLLE) [10], Local Space Tangent Analysis 
[11], Diffusion Maps [12] and Semidefinite Embedding (SDE) [13]. 

The algorithms mentioned above, consider the problem of learning a lower-dimensional em- 
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bedding of the data. In classification, such algorithms can be used to preprocess high-dimensional 
data before performing the classification. This could potentially allow for a lower computational 
complexity of the classifier. In some cases, dimensionality reduction results increase the computa- 
tional complexity of the classifier. In fact, support vector machines suggest the opposing strategy: 
data points are projected onto a higher-dimensional space and classified by a low computational 
complexity classifier. To guarantee a low computational complexity of the classifier of the low- 
dimensional data, a classification constrained dimensionality reduction (CCDR) algorithm was 
introduced in [14]. The CCDR algorithm is an extension of Laplacian eigenmaps [8] and it 
incorporates class label information into the cost function, reducing the distance between points 
with similar label. Another algorithm that incorporates label information is the marginal fisher 
analysis (MFA) [15], in which a constraint on the margin between classes is used to enforce 
class separation. 

In [14] the CCDR algorithm was only studied for two classes and its performance was 
illustrated for simulated data. In [16], a multi-class extension to the problem was presented. In this 
paper, we introduce two additional components that make the algorithm computationally viable. 
The first is an out-of-sample extension for classification of unlabeled test points. Similarly to the 
out-of-sample extension presented in [17], one can utilize the Nystrom formula for classification 
problems in which label information is available. We study the algorithm performance as its 
various parameters, (e.g., dimension, label importance, and local neighborhood), are varied. We 
study the performance of CCDR as preprocessing prior to implementation of several classi- 
fication algorithms such as A;-nearest neighbors, linear classification, and neural networks. We 
demonstrate a 10% improvement over the fc-nearest neighbors algorithm performance benchmark 
for this dataset. We address the issue of dimension estimation and its effect on classification 
performance. 

The organization of this paper is as follows. Section Ull] presents the multiple-class CCDR 
algorithm. Section ?? provides a study of the algorithm using the Landsat dataset and Section 
[VTl summaries our results. 

II. Dimensionality Reduction 

Let Xn = {xi,X2, . . . ,Xn} bc a set of n points constrained to lie on an m-dimensional 
submanifold Ai C R'^. In dimensionality reduction, our goal is to obtain a lower-dimensional 
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embedding = {Vi, 2^2) • • • > Vn} (where G R"^ with m < d) that preserves local geometry 
information such that processing of the lower dimensional embedding ^„ yields comparable 
performance to processing of the original data points Xn- Alternatively, we would like learn 
the mapping / : A1 C R*^ — > R"* that maps every data point Xi to ?/j = fi^i) such that 
some geometric properties of the high-dimensional data are preserved in the lower dimensional 
embedding. The first question that comes to mind is how to select /, or more specifically how 
to restrict the function / so that we can still achieve our goal. 

A. Linear dimensionality reduction 

1 ) PCA: When principal component analysis (PCA) is used for dimensionality reduction, one 
considers a linear embedding of the form 

Vi = fi^i) = 

where A is m x d. This embedding captures the notion of proximity in the sense that close 

points in the high dimensional space map to close points in the lower dimensional embedding, 

i.e., - y^W = \\A{xi - Xj)\\ < ||A||||a:;j - Xj\\. Let 

1 " 
X — > ajj 
n ^-^ 
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Since y^ — Axi, we have y = Ax and Cy = AC^A^ . In PCA, the goal is to find the projection 
matrix A that preserves most of the energy in the original data by solving 



maxtr{Cj/(A)} s.t. AA^ = J, 



which is equivalent to 



maxtr{AC^A^} s.t. AA^^I. (1) 
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The solution to Qi, is given by A = [ui, U2, . . . , u^]^, where Uj is the eigenvector of C^^ 
corresponding to its ith largest eigenvalue. When the data lies on an m-dimensional hyperplane, 
the matrix Cx has only m positive eigenvalues and the rest are zero. Furthermore, every Xi 
belongs to a; + spanjui, U2, . . . , u^} ^ R'^. In this case, the mapping PCA finds f{x) = Ax 
is one-to-one and satisfies \\f{xi) — f{xj)\\ = \\A(xi — Xj) \ = \\xi — Xj\\. Therefore, the lower 
embedding preserves all the geometry information in the original dataset X. We would like to 
point out that PCA can be written as 

n 

max^^ \\y^ — yjW^ s.t. = Axi and AA^ = I, 

2) MDS: Multidimensional Scaling (MDS) differs from PCA in the way the input is provided 
to it. While in PCA, the original data X is provided, the classical MDS requires only the set 
of all Euclidean pairwise distances {||a3j — Xj\\2}1Zijyi. As MDS uses only pairwise distances, 
the solution it finds is given up to translation and unitary transformation. Let x'- = Xt — c, the 
Euclidean distance || a:; ■ — a;^ || is the same as || a^j — ajj || . Let U be an arbitrary unitary matrix U 
satisfying U^U = I and define x[ = Ux. The distance || equal to ||t/(a;j — Xj)\\, 

which by the invariance of the Euclidean norm to a unitary transformation equals to ||a;j — a;j||. 
Denote the pairwise squared-distance matrix by [D2\ij = \\xi — XjW^. By the definition of 
Euclidean distance, the matrix D2 satisfies 

D2 = 10^ + 01^ - 2X^X, (2) 

where X = [a;i, a;2, . . . , a;„] and (p = [WxiW^ , \\x2\\'^ , . . ■ , \\xn\\'^]'^ . To verify (|2l), one can 
examine the ij-th term of D2 and compare with \\xi — XjW^. Denote the n x n matrix H = 
I — 11^/n. Multiplying both sides of D2 with H in addition to a factor of — |, yields 

-^HD2H = {XHfiXH), 

which is key to MDS, i.e., Cholesky decomposition of — |i3"I?2-ff yields X to within a trans- 
lation and a unitary transformation. Consider the eigendecomposition —^HD2H = UAU'^. 
Therefore, a rank d X can be obtained as X = A^U'^, where = diag{[Ai, A2, . . . , A^]}^ 
and Ud = [ui, U2, . . . , Ud]. Note that Xi3" is a translated version of X, in which every column 
Xi is translated to a; j — a;. 

To use MDS for dimensionality reduction, we can consider a two step process. First a square- 
distance matrix D2 is obtained from the high-dimensional data X. Then, MDS is applied to D2 
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to obtain a low-dimensional (m < d) embedding by = A^C/^ = XUmU^. In the absence 
of noise, this procedure provides an affine transformation to the high-dimensional data and thus 
can be regarded as a linear method. 

B. Nonlinear dimensionality reduction 

Linear maps are limited as they cannot preserve the geometry of nonlinear manifolds. 

1 ) Kernel PCA: Kernel PCA is one of the first methods in dimensionality reduction of data 
on nonlinear manifolds. The method combines the dimensionality reduction capabilities of PCA 
on linear manifolds with a nonlinear embedding of data points in a higher (or even infinite) 
dimensional space using "kernel trick" [6]. In PCA, one finds the eigenvectors satisfying: Cx^k — 
AfeVfe. Since can be written as a linear combination of the Xi's: Vfc = '^iO(ki{xi — x), one 
can replace in the eigendecomposition, simplify, and obtain: X{KoLk — XkOik) = 0, where 
Kij = {xi — x)'^{xj — x). Consider the mapping cj) : M. Ti from the manifold to a Hilbert 
space. The "kernel trick" suggests replacing Xi with 4){xi) and therefore rewriting the kernel 
as — (t){xiY(t){xj). Further generalization can be made by setting K^j — K{xi, Xj) where 
K{-,-) is positive semidefinite. The resulting vectors are of the form = '^■aki(l>{xi) and 
thus implementing a nonlinear embedding into a nonlinear manifold. 

2) ISOMAP: In [7], Tenenbaum et al find a nonlinear embedding that rather than preserving 
the Euclidean distance between points on a manifold, preserves the geodesic distance between 
points on the manifold. Similar to MDS where a lower dimensional embedding is found to 
preserve the Euclidean distances of high dimensional data, ISOMAP finds a lower dimensional 
embedding that preserves the geodesic distances between high-dimensional data points. 

3) Laplacian Eigenmaps: Belkin and Niyogi's Laplacian eigenmaps dimensionality reduction 
algorithm [8] takes a different approach. They consider a nonlinear mapping / that minimize 
the Laplacian 



Since the manifold is not available but only data point on it are, the lower dimensional embedding 
is found by minimizing the graph Laplacian given by 




(3) 



n 




(4) 



i=l 
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where Wij is the ijth element of the adjacency matrix which is constructed as follows: For 
/c e M, a /c -nearest neighbors graph is constructed with the points in Xn as the graph vertices. 
Each point Xi is connected to its A; -nearest neighboring points. Note that it suffices that either 
Xi is among x/s /c-nearest neighbors or Xj is among Xi's A; -nearest neighbors for Xi and Xj to 
be connected. For a fixed scale parameter e > 0, the weight associated with the two points Xi 
and Xj satisfies 

{exp{— Ijccj — ccjip/e} if Xi and Xj are connected 
otherwise. 

III. Classification Constrained Dimensionality Reduction 

A. Statistical framework 

To put the problem in a classification context, we consider the following model. Let Xn — 
{a?i, X2, . . . , Xn} be a set of n points sampled from an m-dimensional submanifold Ai C R*^. 
Each point £Cj G is associate with a class label q e A = {0,1,2,...,L}, where q = 
corresponds to the case of unlabeled data. We assume that pairs (ccj, q) e x ^ are i.i.d. drawn 
from a joint distribution 

P{x, c) = p^{x\c)Pc{c) = Pc{c\x)p^{x), (5) 

where Px{x) > and Px{x\c) > (for x e M.) are the marginal and the conditional probability 
density functions, respectively, satisfying jj^px{x)dx = 1, Jj^px{x\c)dx = 1 and Pc{c) > 
and Pc{c\x) > are the a priori and a posteriori probability mass functions of the class label, 
respectively, satisfying Y^^Pdc) = 1 and Ylc^d'^l^) ^ While we consider unlabeled points 
of the form (cCj, 0) similar labeled points, we still make the following distinction. Consider the 
following mechanism for generating an unlabeled point. First, a class label c e {1,2, ...,L} 
is generated from the labeled a priori probability mass function Pc{c) — P{c\c is labeled) = 
Pc{c)/ J2c'^i^c{c'). Then Xi is generated according to px{x\c). To treat c as an unobserved 
label, we marginalize P{x,c\c is labeled) = px{x\c)Pdc) over c: 

Px{x\c = 0) = 2^Px{x\c = q)P,{q) = ^ ^T^T ■ (6) 

q=l Z^c'=l ^c[C ) 

This suggests that the conditional PDF of unlabeled points /a;(a;|c = 0) is uniquely determined by 
the class priors and the conditionals for labeled point. We would like to point out that this is one 
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of few treatments that can be offered for unlabeled point. For example, in anomaly detection, 
one may want to associate the unlabeled point with contaminated data points, which can be 
represented as a density mixture of Px{x\c — 0) and 7(0;) (e.g., 7(0;) is uniform in X). 

In classification constraint dimensionality reduction, our goal is to obtain a lower-dimensional 
embedding — {y^, y2,..., y^} (where ?/j e R"* with m < d) that preserves local geometry 
and that encourages clustering of points of the same class label. Alternatively, we would like 
to find a mapping f(x,c^:MxA^ R™ for which y^ = f{xi,Ci) that is smooth and that 
clusters points of the same label. 

We introduce the class label indicator for data point Xi dLS Cki — I{ci — k), for k — 1,2, ... ,L 
and i — 1, 2, . . . , n. Note that when point Xi is unlabeled c^j = for all k. Using the class 
indicator, we can write the number of point in class k d&Uk — Yll=i ^ki- If all points are labeled, 
then n = X)fc=i ^k- 

B. Linear dimensionality reduction for classification 

1 ) LDA: Restricting the discussion to linear maps, one can extend PCA to take into account 
label information using the multi-class extension to Fisher's linear discriminant analysis (LDA). 
Instead of maximizing the data covariance matrix, LDA maximizes the ratio of the between- 
class-covariance to within-class-covariance. In other words, we obtain a linear transformation 

t/j = f{Xi, Ci) = Axi with matrix A that is the solution to the following maximization: 



maxtr{ACBA^} s.t. ACw^^ = I, 



A 



(7) 



where 




k=l 



is the between-class-covariance matrix, a;*^*^^ 



Ylii '^ki^i/iT'k is the kth class center, x — Xi/n 



is the center point of the dataset. 



1 



L 



(fe) 



n 



k=l 



is the within-class-covariance, and 



ik)_T:Uck^ix,-x('^^){x,-x'~'^Y 
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is within-class-A; covariance matrix. In Fig. [U LDA selects an embedding that projects the 
data onto e2 since the maximum distance between classes is achieved along with a minimum 
class variance when projecting the data onto 62. We are interested in exploring a strategy that 
maximizes class separation in the lower dimensional embedding. 

2) Marginal Fisher Analysis: Recent work [15], presents the marginal Fisher analysis (MFA), 
which is a method that minimizes the ratio between intraclass compactness and interclass 
separability. In its basic formulation MFA is a linear embedding, in which = Axi. Another 
aspect of the method is that it considers two classes. The kernel trick is used to provide a 
nonlinear extension to MFA. To construct the cost function, two quantities are of interest: 
intraclass compactness and interclass separability. The intraclass compactness can be written 
as 

X^^iilll/* - l/jf , (8) 

where Wij is given by 

= {^CkiCkj)I{xi e N;l^{xj) or Xj e N;l^{x,)) (9) 

k 

and Nj^{x) denote the A;-nn neighborhood of x within the same class as x. Note that the term 
CkiCkj is one if Xi and Xj have the same label and zero otherwise. Similarly, the interclass 
separability can be written as 

"^WijWyi-yjf, (10) 

where Wij is given by 

^ij = (1 - 5^CA,.,Cfcj)/(a;i G N^^{xj) or Xj G N-^{xi)) (11) 

k 

and Nj^(x) denote the A;-nn neighborhood of x outside the class of x. 

IV. Dimensionality reduction for classification on nonlinear manifolds 

Here, we review the CCDR algorithm [14] and its extension to multi-class classification. 
To cluster lower dimensional embedded points of the same label we associate each class with 
a class center namely 2;^. G R™. We construct the following cost function: 

Ji2L,yn) = ^Cki \\zk - yif + ^^w^j \\yi - (12) 

ki ij 
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where = {zi, . . . , z^} and /3 > is a regularization parameter. We consider two terms on 
the RHS of (fT2l) . The first term corresponds to the concentration of points of the same label 
around their respective class center. The second term is as in (|4]) or as in Laplacian Eigenmaps 
[8] and controls the smoothness of the embedding over the manifold. Large values of P produce 
an embedding that ignores class labels and small values of /? produce an embedding that ignores 
the manifold structure. Training data points will tend to collapse into the class centers, allowing 
many classifiers to produce perfect classification on the training data without being able to control 
the generalization error (i.e., classification error of the unlabeled data). Our goal is to find 
and that minimize the cost function in (fT2)) . 

Let C be the Lxn class membership matrix with c^i as its ki-th element, Z = [zi, . . . , Zi, y^, . . . 
and be the L x L all zeroes matrix and 

C 

G = 

I3W 

Minimization over Z of the cost function in (fT2l) can be expressed as 

min tr (ZXZ^) , (13) 

ZDl = 
ZDZ^ = I 

where D = diag{Gl} and L = D — G. To prevent the lower-dimensional points and the 
class centers from collapsing into a single point at the origin, the regularization ZDZ^ = I 
is introduced. The second constraint ZDl = is constructed to prevent a degenerate solution, 
e.g., Zi = . . . = Zl = y-^^ = . . . = y^. This solution may occur since 1 is in the null-space of 
the Laplacian L operator, i.e., LI = 0. The solution to (fT3l) can be expressed in term of the 
following generalized eigendecomposition 

^H^H^^W^H^H^ (14) 

where A^"^ is the fcth eigenvalue and u^"-* is its corresponding eigenvector. Note that we include 
to emphasize the dependence on the n data points. Without loss of generality we assume 
Ai < A2 < . . . < Xn+L- Specifically, matrix Z is given by [u2, U3, . . . , Um+i]^, where the first L 
columns correspond to the coordinates of the class centers, i.e., z^ = Ze^, and the following n 
columns determine the embedding of the n data points, i.e., y^ = Zei^f We use to denote 
the canonical vector such that [e^js = 1 for element s = i and zero otherwise. 
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A. Classification and computational complexity 

In classification, the goal is to find a classifier ax{x) : Ai ^ A based on the training data 
that minimizes the generalization error: 

a = argmin E[I{a{x) ^ a)], (15) 

where the expectation is taken w.r.t. the pair {x, a). Since only samples from the joint distribution 
of X and a are available, we replace the expectation with a sample average w.r.t. the training 
data i J2^=i H'^i^i) 7^ '^i)- During the minimization, we search over a set of classifiers ax{x) : 
C R"' ^ ^, which is defined over a domain in R"*. In our framework, we suggest replacing 
a classifier a,j.{x) : C R"' ^ ^ with dimensionality reduction via CCDR f(x) : M. C 
R'' — > R™ followed by a classifier on the lower-dimensional space ay{y) : R™ A, i.e., 
ttx = cby o f- The first advantage is that the search space for the minimization in (fT5l) defined 
over a rf-dimensional space can be reduced to an m-dimensional space. This results in significant 
savings in computational complexity if the complexity associated with the process of obtaining 
/ can be made low. In general, the classifier set JF has to be rich enough to attain a lower 
generalization error. The other advantage of our method lies in the fact that CCDR is designed 
to cluster points of the same label thus allowing for a linear classifier or other low complexity 
classifiers. Therefore, further reduction in the size of class JF can be achieved in addition to the 
reduction due to a lower-dimensional domain. To classify a new data point, one has to apply 
CCDR to a new data point. If it is done brute force, the point is added to the set of training 
points with no label a new matrix W is formed and an eigendecomposition is carried out. 

When performing CCDR, each of the n(n — l)/2 terms of the form {\\xi — Xj]]"^} requires 
one summation and d multiplications leading to computational complexity of the order 0{dn^). 
Construction of a i^-nearest neighbors graph requires 0{kn) comparisons per point and therefore 
a total of 0{kn^). The total number of operations involved in constructing the graph is therefore 
0{{k + d)n^). Next, an eigendecomposition is applied to W , which is an (L + x (L + n) 
matrix. The associated computation complexity is 0{'rr'). Therefore, the overall computational 
complexity of CCDR is 0(n'^). This holds for both training and classification as explained 
earlier. We are interested in reducing computational complexity in training the classifier and in 
classification. For that purpose, we consider an out-of-sample extension of CCDR. 



February 21, 2008 



DRAFT 



12 



V. Out-of-Sample Extension 
We start by rearranging the generalized eigendecomposition of the Laplacian in ([14)) as 

G^'^^uS") = (1 - aS"^)£)(")uS"\ (16) 



and recall that u^"^ = [zi{l), zi(/), . . . , Zi{l), yi{l), y2{l),. ■ ■ , Since we consider an 



m- 



dimensional embedding, we are only interested in eigenvectors U2, . . . , u,„_|_i. The L + i equation 
(row) for 2 = 1, 2,. ..,n in the eigendecomposition in (fT6l) can be written as 



Similarly, the kth equation (row) of (fT6l) for A; = 1, 2, . . . , L is given by 

Our interest is in finding a mapping f{x,c) that in addition to mapping every Xi to y^, can 
perform an out-of-sample extension, i.e., is well-defined outside the set X. We consider the 
following out-of-sample extension expression 

1 Iic,^0)zi-\l)+pj:^K{x,x,)yf{l) 
^""'"^ l-Aj") I{c^^)+f3Y.,K{x,x,) ' 

where z^") is the same as in (fTSi) . This formula can be explain as follows. First, the lower 
dimensional embedding y''i \ . . . , t/n"^ and the class centers z^f^, . . . , z^'' are obtained through 
an the eigendecomposition in (fT6l ). Then, the embedding outside the sample set X is calculated 
via (fT9l ). By comparison of f'f'\xi^ Ci) evaluated through (fT9l ) with (fTTI) . we have f'f'\xi, Ci) = 
y^P\l). This suggests that the out-of-sample extension coincides with the solution, we already 
have for the mapping at the the data points X. Moreover, using this result one can replace all 
y-"-* with f\"\xi,Ci) in (fT9l ) and obtain the following generalization of the eigendecomposition 
in ([Ml): 

f i c) = — . , ' ^ . , (20) 



yin) I{c^0)+[3Z,K{x,x, 



and 



' (i-A|"'K • 
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In [18], it is propose that if the out-of- sample solution to the eigendecomposition problem 
associated with kernel PCA converge, it is given by the solution to the asymptotic equivalent 
of the eigendecomposition. Using similar machinery, we can provide a similar result suggesting 
that if f^p\x,c) —>■ f'f°\x,c) as n ^ oo, then the asymptotic equivalents to (|20|) and (|2TI) 
should provide the solution to the limit of f["'\x,c). The asymptotic analogues to (fTTI) and 
(fTSi) are described in the following. The mapping for labeled data fi{x,c) : M x ^ ^ R for 
c = 0,l,2,...,L equivalent to equation (fTTI) is 

, , 1 /(c ^ 0)z,il) + (3' Ee-=o Im x')Mx', c')Pix', c')dx' 
•^'^'^'''^ I{c^Q)+P' ^j^K{x,x')p{x')dx' ^ ^ 



where Zc{l) for c = 1, 2, . . . , L is equivalent to (fTSi) 

^ fi{x,c)p{x\c)dx 
Zc{l) = \ , ' ^ , (23) 
i — Ai 

and P' = (3n. Since we are interested in an m-dimensional embedding, we consider only / = 
1,2, ... ,m, i.e., the eigenvectors that correspond to the m smallest eigenvalues. To guarantee 
that the relevant eigenvectors are unique (up to a multiplicative constant), we require Ai < A2 < 
■ ■ ■ < Am+l < Am+2 < ■ • ■ A„. 

The out-of- sample extension given by (fT9l ). can be useful in a couple of scenario. The first, 
is in classification of new unlabeled samples. We assume that {j/^}"^^, {zk}^^^, and {A/}]^i 
are already obtained based on labeled (or partially labeled) training data and we would like to 
embed a new unlabeled data point. We consider using (fT9l ) with c = 0, i.e., we can use f{x, 0) to 
map a new sample x to R™. The obvious immediate advantage is the savings in computational 
complexity as we avoid performing addition eigendecomposition that includes the new point. 

The second scenario involves the out-of-sample extension for labeled data. The goal here is 
not to classify the data since the label is already available. Instead, we are interested in the 
training phase in the case of large n for which the eigendecomposition is infeasible. In this 
case, a large amount of labeled training data is available but due to the heavy computational 
complexity associated with the eigendecomposition in (fT4l) (or by (fT6l)). the data cannot be 
processed. In this case, we are interested in developing a resampling method, which integrates 
f^j^\x,c) obtained for different subsamples of the complete data set. 
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A. Classification Algorithms 

We consider three widespread algorithms: fc-nearest neighbors, linear classification, and neural 
networks. A standard implementation of fc-nearest neighbors was used, see [1, p. 415]. The linear 
classifier we implemented is given by 

c(y) = arg max y'^a.^^^ + ai^^ 
ce{Ai,...AL} 

n 

4"^'=^] = arg min q: + ao - Cfc^)^ 

[",00] 

1=1 

for k = 1, . . . , L. The neural network we implemented is a three-layer neural network with d 
elements in the input layer, 2d elements in the hidden layer, and 6 elements in the output layer 
(one for each class). Here d was selected using the common PCA procedure, as the smallest 
dimension that explains 99.9% of the energy of the data. A gradient method was used to train the 
network coefficients with 2000 iterations. The neural net is significantly more computationally 
burdensome than either linear or /c-nearest neighbors classifications algorithms. 

B. Data Description 

In this section, we examine the performance of the classification algorithms on the benchmark 
label classification problem provided by the Landsat MSS satellite imagery database [19]. Each 
sample point consists of the intensity values of one pixel and its 8 neighboring pixels in 4 
different spectral bands. The training data consists of 4435 36-dimensional points of which, 
1072 are labeled as 1) red soil, 479 as 2) cotton crop, 961 as 3) grey soil, 415 as 4) damp 
grey soil, 470 are labeled as 5) soil with vegetation stubble, and 1038 are labeled as 6) very 
damp grey soil. The test data consists of 2000 36-dimensional points of which, 461 are labeled 
as 1) red soil, 224 as 2) cotton crop, 397 as 3) grey soil, 211 as 4) damp grey soil, 237 are 
labeled as 5) soil with vegetation stubble, and 470 are labeled as 6) very damp grey soil. In 
the following, each classifier is trained on the training data and its classification is evaluated 
based on the entire sample test data. In Table H we present "best case" performance of neural 
networks, linear classifier, and A;-nearest neighbors in three cases: no dimensionality reduction, 
dimensionality reduction via PCA, and dimensionality reduction via CCDR. The table presents 
the minimum probability of error achieved by varying the tuning parameters of the classifiers. 
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The benefit of using CCDR is obvious and we are prompted to furtlier evaluate the performance 
gains attained using CCDR. 





Neural Net. 


Lin. 


fc-nearest neigh. 


No dim. reduc. 


83 % 


22.7 % 


9.65 % 


PCA 


9.75 % 


23 % 


9.35 % 


CCDR 


8.95 % 


8.95 % 


8.1 % 



TABLE I 

Classification error probability 



C. Regularization Parameter (3 

As mentioned earlier, the CCDR regularization parameter 13 controls the contribution of the 
label information versus the contribution of the geometry described by the sample. We apply 
CCDR to the 36-dimensional data to create a 14-dimensional embedding by varying (3 over a 
range of values. For justification of our choice of = 14 dimensions see Section IV-DI In the 
process of computing the weights Wij for the algorithm, we use /c-nearest neighbors with k = A 
to determine the local neighborhood. Fig. |2] shows the classification error probability (dashed 
lines) for the linear classifier vs. (3 after preprocessing the data using CCDR with k = A and 
dimension 14. We observe that for a large range of [3 the average classification error probability 
is greater than 0.09 but smaller than 0.095. This performance competes with the performance 
of fc-nearest neighbors applied to the high-dimensional data, which is presented in [1] as the 
leading classifier for this benchmark problem. Another observation is that for small values of 
(3 (i.e., (3 < 0.1) the probability of error is constant. For such small value of (3, classes in 
the lower-dimensional embedding are well- separated and are well-concentrated around the class 
centers. Therefore, the linear classifier yields perfect classification on the training set and fairly 
low constant probability of error on the test data is attained for low value of (3. When (3 is 
increased, we notice an increase in the classification error probability. This is due to the fact 
that the training data become non separable by any linear classifier as [3 increases. 

We perform a similar study of classification performance for A;-nearest neighbors. In Fig. [21 
classification probability error is plotted (dotted lines) vs. [3. Here, we observed that an average 
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error probability of 0.086 can be achieved for (3 ~ 0.5. Therefore, A;-nearest neighbors preceded 
by CCDR outperforms the straightforward /c -nearest neighbors algorithm. We also observe that 
when 13 is decreased the probability of error is increased. This can be explained as due to the 
ability of A; -nearest neighbors to utilize local information, i.e., local geometry. This information 
is discarded when /3 is decreased. 

We conclude that CCDR can generate lower-dimensional data that is useful for global classi- 
fiers, such as the linear classifier, by using a small value of and also for local classifiers, such 
as A; -nearest neighbors, by using a larger value (3 and thus preserving local geometry information. 



0.2^ 

0.2 

0.18 

0.16 

0.14 

0.12 

O.Ij: 
f 

0.08 
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10" 
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Fig. 2. Probability of incorrect classification vs. (3 for a linear classifier (dotted line o) and for the fc-nearest neighbors algorithm 
(dashed line o) preprocessed by CCDR. 80% confidence intervals are presented as x for the linear classifier and as + for the 
fc-nearest neighbors algorithm. 



D. Dimension Parameter 

While the data points in Xn may lie on a manifold of a particular dimension, the actual di- 
mension required for classification may be smaller. Here, we examine classification performance 
as a function of the CCDR dimension. Using the entropic graph dimension estimation algorithm 
in [20], we obtain the following estimated dimension for each class: 



class 


1 


2 


3 


4 


5 


6 


dimension 


13 


7 


13 


10 


6 


13 



February 21, 2008 



DRAFT 



17 



Oh 



0.12- <> 




() () $ 



(,,,,,(), 

■o--ii- 



<> C) 



8 

dim. 



Fig. 3. Probability of incorrect classification vs. CCDR's dimension for a linear classifier (dotted line o) and for the fc-nearest 
neighbors algorithm (dashed line o) preprocessed by CCDR. 80% confidence intervals are presented as x for the linear classifier 
and as + for the fc-nearest neighbors algorithm. 



Therefore, if an optimal nonlinear embedding of the data could be found, we suspect that a 
dimension greater than 13 may not yield significant improvement in classification performance. 
Since CCDR does not necessarily yield an optimal embedding, we choose CCDR embedding 
dimension as = 14 in Section IV-CI 

In Fig. m we plot the classification error probability (dotted line) vs. CCDR dimension and 
its confidence interval for a linear classifier. We observed decrease in error probability as the 
dimension increases. When the CCDR dimension is greater than 5, the error probability seems 
fairly constant. This is an indication that CCDR dimension of 5 is sufficient for classification if 
one uses the linear classifier with P = 0.5, i.e., linear classifier cannot exploit geometry. 

We also plot the classification error probability (dashed line) vs. CCDR dimension and its 
confidence interval for fc-nearest neighbors classifier. Generally, we observe decrease in error 
probability as the dimension increases. When the CCDR dimension is greater than 5, the error 
probability seems fairly constant. When CCDR dimension is three, classifier error is below 0.1. 
On the other hand, minimum possibility of error obtained at CCDR dimension 12-14. This 
is remarkable agreement with the dimension estimate of 13 obtained using the entropic graph 
algorithm of [20]. 
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Fig. 4. Probability of incorrect classification vs. CCDR's fc-nearest neighbors parameter for a linear classifier (dotted line o) 
and for the fc-nearest neighbors algorithm (dashed line o) preprocessed by CCDR. 80% confidence intervals are presented as x 
for the linear classifier and as + for the fc-nearest neighbors algorithm. 



E. CCDR's k-Nearest Neighbors Parameter 

The last parameter we examine is the CCDR's fc-nearest neighbors parameter. In general, as 
k increases non-local distances are included in the lower-dimensional embedding. Hence, very 
large k prevents the flexibility necessary for dimensionality reduction on (globally) non-linear 
(but locally linear) manifolds. 

In Fig. m the classification probability of error for the linear classifier (dotted line) is plotted 
vs. the CCDR's A;-nearest neighbors parameter. A minimum is obtained at A; = 3 with probability 
of error of 0.092. The classification probability of error for fc-nearest neighbors (dashed line) is 
plotted vs. the CCDR's fc-nearest neighbors parameter. A minimum is obtained at /c = 4 with 
probability of error of 0.086. 

VI. Conclusion 

In this paper, we presented the CCDR algorithm for multiple classes. We examined the per- 
formance of various classification algorithms applied after CCDR for the Landsat MSS imagery 
dataset. We showed that for a linear classifier, decreasing (3 yields improved performance and for 
a /c-nearest neighbors classifier, increasing [3 yields improved performance. We demonstrated that 
both classifiers have improved performance on the much smaller dimension of CCDR embedding 
space than when applied to the original high-dimensional data. We also explored the effect of k 
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in the A;-nearest neighbors construction of CCDR weight matrix on classification performance. 
CCDR allows reduced complexity classification such as the linear classifier to perform better 
than more complex classifiers applied to the original data. We are currently pursuing an out-of- 
sample extension to the algorithm that does not require rerunning CCDR on test and training 
data to classify new test point. 
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